Real Estate Alpha Calculator: 
A Tool for Assessing Risk-Adjusted Returns in Residential Real 
Estate Investing 


David Latortue 


Abstract 


This paper presents the Real Estate Alpha Calculator, a tool designed to aid real estate 
investors by providing a quantitative assessment of investment opportunities. By calculating 
an "alpha" value for properties in Montreal, the tool helps investors evaluate risk-adjusted 
returns. The calculator leverages machine learning models for price prediction and adapts 
the Capital Asset Pricing Model (CAPM) to real estate, offering a data-driven approach to 
streamline investment decision-making. Our results demonstrate its effectiveness in comparing 
properties and identifying high-potential investments while minimizing systemic risks. 


1 Introduction 


Real estate investment involves navigating a competitive landscape where identifying profitable 
opportunities is challenging. The idiosyncratic nature of properties, combined with complex, 
obscure or incomplete market information, makes screening potential investments both costly 
and time-consuming. Currently, investors lack an efficient solution for evaluating relative returns 
while properly accounting for associated risks, hindering their ability to effectively filter and 
prioritize investment opportunities. 


This paper introduces the Real Estate Alpha Calculator, a screening tool designed to help investors 
assess the expected returns and systemic risks of potential real estate investments in Montreal. 
The calculator allows investors to systematically evaluate and compare prospective properties, 
identifying those offering the best risk-adjusted potential returns. This enables investors to 
quickly focus on a targeted set of high-potential properties, saving them from the burden of 
time-consuming extensive market research. 

To achieve this, we implemented the following steps: 


1. Developed comprehensive neighborhood profiles to capture factors influencing property 
values. 


2. Compared properties within similar categories to ensure fair evaluation. 


3. Assigned expected return and risk ratings, enabling a quantitative assessment of investment 
potential. 


Our tool computes an alpha (a) value, to quantify the excess returns of a given property, bridging 
financial modeling and practical real estate investment decisions. Alpha measures an asset’s 
potential performance relative to its peers, offering a data-driven approach to initial screening, 
focusing on systemic risk factors. 
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Paper Structure 
The remainder of this paper is organized as follows: 
e Background: Theoretical foundations in finance and machine learning 
e Methodology: Data processing, risk calculation, price prediction, and alpha computation 


e Results: Model outcomes, tool demonstration, and statistical analysis 


2 Background: A little bit of theory 


2.1 Alpha (a) 


Alpha (which we will refer to as a) is a measure used in finance to assess the performance of 
an investment relative to the market it is part of. a tells you if the actual return of the target 
investment is above or below where it should be, relative to the level of systematic risk involved 
(estimated by its beta) and the overall market conditions|11]. Here is how alpha is calculated : 


a = Ri — [Ry + Bi(Rm — Ry)] (1) 
where: 
e R; is the actual return of the investment. 
e Ry is the risk-free rate (more on this later). 


e 8; is the beta of the investment, which measures its volatility relative to the market (also 
more on that later). 


e Rm is the expected return of the market. 


2.1.1 Interpretation of a 


Once a is calculated, you will obtain a real value that can be either positive or negative. Here’s 
how to interpret it: 


e Positive Alpha (a > 0): A positive alpha indicates that the investment has outperformed 
the market on a risk-adjusted basis. It suggests that the investment generated returns 
higher than expected given its level of risk. 


e Zero Alpha (a = 0): A zero alpha means that the investment has performed exactly as 
expected based on its risk level. In other words, it neither outperformed nor underperformed 
the market. In practice, to determine if an investment aligns with market expectations, we 
set bounds around zero. 


e Negative Alpha (a < 0): A negative alpha suggests that the investment has underper- 
formed the market on a risk-adjusted basis. The investment’s returns were lower than 
expected given its risk, indicating poor performance relative to the benchmark. 


2.2 Risk (8) 


Beta is a measure of risk of an asset relative to the market. The measure encapsulates the relative 
volatility of the asset compared to the market. Volatility is simply the variability in the pricing 
of an asset [12]. The higher the gaps in the high and lows of the pricing of an asset, the more 
volatile it is. To measure the risk, we can first compute the standard deviation (often referred to 
as g) in the price history of the asset and the one its market as such : 
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where: 
e c is the volatility (standard deviation of returns). 
e N is the number of periods (e.g., days, months, years). 
e R; is the return of the asset in period i. 


e R is the average return of the asset over N periods. 


Then, we need to know if the asset and the market are correlated in any way and if the move in 
the same direction or opposites. The correlation between two variables X and Y is calculated 
as: 

D(X — X)(%i — Y) 
VE- AT. VEN- FP 


Corr( X,Y) = 


Where: 
e X; and Y; are the individual data points of X and Y. 
e X and Y are the means (averages) of X and Y, respectively. 
e >> denotes summation over all data points. 


This output of Corr is a value between —1 and 1, its interpretation can be done as follow : 


e A value of 1 indicates a perfect positive linear relationship. 
e A value of —1 indicates a perfect negative linear relationship. 


e A value of 0 indicates no linear relationship between the variables. 


Finally, to calculate 8 given the price history of the asset A and the market M we simply do the 
following : 


o(P) 
o(M) 


Ideally, you would want the lowest risk possible on an investment. However, in finance, there is a 
fundamental trade-off between risk and reward: higher risks means higher returns. 


B(A, M) = Corr(A, M) - 


Interpretation of 6 
e 8 > 1: More volatile than the market (higher risk, potentially higher returns) 
e 8 = 1: Same risk level as the market 
e 0< <1: Lower risk than the market (steady returns) 
e 6 =0: Uncorrelated with the market 


e 8 <0: Negatively correlated with the market 
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Figure 1: Comparison of volatility in pricing examples. 


2.2.1 Risk-Free rate 


The risk-free rate is the theoretical return an investor can expect from an investment with no risk 
of financial loss. It represents the minimum return required for an investment that carries no risk 
of default, credit, or inflation impact. In practice, the risk-free rate is typically represented by 
the yield on short-term government-issued securities, such as Canadian Treasury bills (T-bills), 
because these are backed by the Canadian government’s full faith and credit, making them highly 
secure and virtually free from default risk. If an investment opportunity has any risk but offers 
returns lower than the risk-free rate, it’s generally wise to avoid it, as you’re not being adequately 
compensated for taking on additional risk. In our case, we will refer to the policy interest rate 


from Bank Of Canada : Bank of Canada Key Interest Rate 


2.3 Alpha (qa) for Real Estate (RE) 


Real estate offers various investment opportunities, including Real Estate Investment Trusts 
(REITs), house flipping, and commercial properties. This discussion will focus on buying and 
selling residential properties. In this domain, investors use several metrics to evaluate different 
aspects of a real estate investment. Here are a few key metrics: 


e Net Operating Income (NOI): A measure of profitability of an investment, calculated 
as Gross Operating Income (revenues) — Operating Expenses. 


e Cash on Cash Return: The ratio of annual pre-tax cash flow to total cash invested, 
Annual Pre-Tax Cash Flow 


calculated as Total Cash Invested 


e Cap Rate: The capitalization rate, calculated as ieee Vale 


e Purchase Price: The price paid for the property. 
e Property Value: The market value of the property. 


e Internal Rate of Return (IRR): An important metric not specific to real estate, 
representing the annualized rate of return. 


Traditionally, alpha is used in public financial markets, but it can also be applied to real estate. 
To adapt alpha for real estate, we must address several considerations. First, individual properties 
often lack significant price histories because homeowners and residential real estate investors 
typically hold assets for long periods, making buying and selling slow and costly (legal fees, 
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inspections, Insurance, taxes, Appraisal, etc). To overcome this challenge, we propose defining 
criteria to identify properties similar to a given one, thereby establishing a comparative price 
history. We refer to this definition as the Property Class (Pc). The definition is shown in the 
following formula : 


Po = {pi | pi € Prices, Marketcolumns|?] = Criteria} (3) 
Where: 


e Po: Set of property prices from the market that match the specified criteria. 
e Prices: Vector of all property prices available in the market. 


e Market: Matrix where each row represents a property and each column represents a feature 
(e.g., neighborhood’, ’propertT type’, etc.). 


e Columns: List of columns in Market used to filter properties, corresponding to [’neighbor- 
hood’, ’propertyType’, ’totalUnits’, ’residentialUnits’, ’businessUnits’]. 


e Criteria: Vector of specific values for the columns, representing the desired attributes of 
the property. 


e i: Index of properties in Market where the columns match the criteria. 


For a given property P, we find the matches of its property class Po and use them as price history 
to estimate the risk of our class Gc. This allows us to circumvent the statistical limitations that 
come from the low observed volatility and limited price history of each individual residential real 
estate asset. We provide more details on this methodology in section 

The formula 8 of property Po against the market Mc can be expressed as follow: 


Bo = (Pc, Mc) 


We opt to use a machine learning approach to estimate the expected returns from a property. 
More on this will be detailed in section 


2.3.1 Final formula a 


To apply a to real estate, we need to define the actual return on investment R; and the expected 
market return Rm. For R;, we can use metrics such as the capitalization rate (cap rate), internal 
rate of return (IRR), or cash-on-cash return. We choose the cap rate due to its simplicity and 
its independence from an individual’s financing strategy. For Rm, we utilize machine learning 
models trained on our Property Sales Listing and Renting datasets. These models, denoted as 
fp(P) and f,(P) respectively, where P represents the property, predict the property prices and 
rental incomes to estimate the expected return. Thus, the final formula we will use is: 


NOI f(P) 
[Ry + Bol CP) — Ry)! (4) 


The methodology section will show details of the algorithm used in practice to find a. 


QRE = : 
PurchasePrice 


2.4 Machine Learning Concepts 


As mentioned in the previous section, to estimate the expected rent and property value of a 
property in the market (V, and V,), we propose training a machine learning model for regression 
on data from various listing from the web. The estimates of the values of rent and property sales 
of the property P are expressed as following for f,(P) and f,(P) : 
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V=f(P) (5) 


In this subsection, we explore various machine learning concepts and models that are widely 
used for regression tasks, particularly when dealing with tabular data. The models range 
from traditional non-deep learning estimators to modern deep learning approaches, including 
transformers specifically tailored for tabular data. 


2.4.1 Non-Deep Learning Estimators 


Gradient Boosting Models 

These models build trees sequentially, with each new tree correcting errors made by the previous 
ones. They combine weak learners (shallow trees) to create a strong predictor, optimizing a loss 
function at each step [6]. We trained the following models from Scikit-Learn : Gradient Boosting 
Regressor, CatBoost[15], LightGBM[13], XGBoost[3] and NGBoost [5]. 


Random Forest 

Random Forest creates multiple decision trees using random subsets of features and data samples 
[1]. It then aggregates predictions from all trees, typically using majority voting for classification 
or averaging for regression. We trained RandomForest from Scikit-Learn. 


Linear Models 
These models assume a linear relationship between features and the target variable. They fit a 
linear function to the data, with different regularization techniques to prevent overfitting: 


e Ridge [9]: Uses L2 regularization 
e Lasso [18]: Uses L1 regularization 


e ElasticNet [21]: Combines L1 and L2 regularization 


Support Vector Machines 

Support Vector Regressor (SVR) works by finding the hyperplane that maximizes the margin 
between data points[4]. It can handle non-linear relationships by using kernel functions to 
transform the input space. 


2.4.2 Deep Learning Models for Tabular Data 


Transformers: Originally developed for Natural Language Processing (NLP) tasks, transformers 
have been adapted for tabular data[19]. They are adept at learning complex dependencies and 
feature interactions but typically require substantial amounts of data and careful tuning. 


Tab Transformer: The Tab Transformer is a specialized variant of the transformer model 
designed for tabular data[10]. It efficiently captures feature interactions without requiring 
extensive preprocessing, making it particularly effective for datasets with categorical variables. 


FT Transformer: The FT Transformer builds upon the Tab Transformer by extending its 
capability to handle both categorical and numerical features [7]. Instead of encoding only 
the categorical variables, the FT Transformer incorporates numerical features directly into the 
transformer architecture. 
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ResNet: ResNet is a powerful Convolutional Neural Network that has proven to be particularly 
effective in computer vision tasks|8]. However, researchers have shown that it could also perform 


well for tabular tasks|20]. 


2.4.3  Ensembling Techniques 


Ensembling techniques are commonly used to improve model performance by combining predictions 
from multiple models. Key ensembling methods include: 


e Stacking and Blending: Stacking involves combining different types of models, such as 
linear models with tree-based models, to leverage their complementary strengths. 


e Voting: Voting methods aggregate the predictions from multiple models to make a final 
prediction. Common strategies include majority voting for classification tasks and averaging 
for regression tasks. 


e Averaging/ Weighted Averaging: Averaging involves combining predictions from multi- 
ple models by calculating their average. This technique is often used in ensemble methods 
to improve the robustness of predictions. Additionally, the contribution of each models can 
be adjusted to increase performance. 


While traditional machine learning models often outperform deep learning methods on tabular 
data, exploring deep learning approaches, particularly transformer-based models, can be valuable. 
This exploration is especially pertinent given the increasing complexity of tabular datasets and 
the potential of transformers to capture intricate feature interactions. 


3 Methodology 


3.1 Data Collection 


The foundation of our Real Estate Alpha Calculator is a comprehensive dataset of property sales 
and apartment rentals across Quebec. This section details our data collection process, sources, 
and the scope of our initial dataset. 


3.1.1 Data Sources 


We employed web scraping techniques to gather data from multiple prominent real estate platforms 
in Quebec. The primary sources for our data collection were: 


e LesPacs 
e Duproprio 


e Centris 


These platforms were chosen for their extensive coverage of the Quebec real estate market, 
ensuring a diverse and representative dataset. 


3.1.2 Web Scraping Methodology 


To efficiently collect data from these sources, we utilized Scrapy, a powerful and flexible web 
scraping framework. Our scraping process began in 2020, allowing us to accumulate a substantial 
amount of historical data. 
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3.1.3 Data Fields 


Our data collection focused on two main categories: property sales and apartment rentals. For 
each category, we collected a comprehensive set of fields to provide a detailed picture of each 
listing. 

Property Sales Data Fields 

For property sales, we collected the following 23 fields: 


>property_id’, ’source’, ’address’, ’neighborhood’, ’price’, ’property_type’, 
-usage’, *construction_date’, ’building_ configuration’, ’potential_gross_revenues’ , 
*building_style’, ’no_of_units’, ’lots_area_in_sqr_ft’, ’no_of_parkings’, 

*parking type’, ’description’, ’title’, ’url’, ’creation_date’, ’has_pool’, 
>pool_type’, ’fireplace_or_stove_type’, ’adapted_for_reduced_mobility’ , 

elevator’, ’close_to_body_of_water’, ’price_in_dollars’ 


Apartment Rentals Data Fields 
For apartment rentals, we collected the following 26 fields: 


neighborhood, no_of_rooms, no_of_bathrooms, category, electricity_included, 
heating_included, wifi_included, parking_included, animal_friendly, 
laundry_in_unit, air_conditioner_included, private_exterior_spaces_included, 
smoking_allowed_included, furnished, lease_duration_in_months, 
details_from_description, water_included, tv_included, laundry_in_building, 
dishwasher, fridge, gym, pool, concierge, security24hrs, bicycle_parking, 
storage_space, elevator, area 


3.1.4 Dataset Size 


Our initial dataset, prior to preprocessing, comprised: 


e Property Sales: 159,346 listings 


e Apartment Rentals: 59,242 listings 


This substantial dataset provides a robust foundation for our analysis, offering a comprehensive 
view of the Quebec real estate market over time. 


3.1.5 Data Collection Challenges and Limitations 
While our data collection process was extensive, it’s important to note some potential limitations: 
e Data availability is limited to listings posted from 2020 onward. 


e The accuracy of the data depends on the information provided by sellers and landlords on 
the source websites. 


e There may be some inconsistencies or missing data across different platforms due to varying 
listing formats and requirements. 


These limitations were addressed during our data preprocessing and cleaning stages, which will 
be discussed in the subsequent section. 
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3.2 Data Preprocessing 


The raw data collected from various real estate platforms underwent a rigorous preprocessing 
pipeline to ensure its quality, consistency, and analytical value. This process was crucial for both 
the property sales dataset (159,346 listings) and the rental dataset (59,242 listings). 

Our preprocessing pipeline consisted of the following key steps: 


1. Geographical Filtering: We filtered listings to include only properties within Montreal, 
utilizing address and neighborhood information. This step ensured our analysis remained 
focused on our target market. 


2. Neighborhood Standardization: We developed and applied a custom Montreal neigh- 
borhood classification system. This standardized location data across listings, allowing for 
consistent geographical analysis. 


3. Property Type Categorization: Properties were classified into standardized categories 
(e.g., Apartment, Condo, Duplex/Triplex). This categorization facilitates market segment 
analysis and ensures comparability across listings. 


4. Feature Engineering: We transformed raw data fields into analytically useful features. 
This process included: 


e Extracting numerical data from text descriptions 
e Standardizing area measurements to square feet 
e Creating binary indicators for amenities (e.g., parking, pool, elevator) 


e Deriving additional features such as property age and unit counts 


5. Data Cleaning: We identified and removed outliers and erroneous entries using domain- 
specific heuristics. This step ensured data integrity and removed potentially misleading 
datapoints. 


6. Missing Value Handling: Where appropriate, we estimated missing values using data 
from similar properties. This approach enhanced dataset completeness while maintaining 
data integrity. 


7. Data Transformation: Certain numerical fields underwent Box-Cox transformation to 
address skewness, improving their suitability for subsequent statistical analyses. 


8. Encoding Categorical Variables: We encoded categorical variables numerically to 
prepare them for machine learning algorithms. 


This comprehensive preprocessing approach transformed our raw data into a robust, clean, and 
analytically rich dataset. It addresses the complexities inherent in real estate data, such as 
diverse property types, inconsistent neighborhood information, and the need to derive meaningful 
features from raw listings. The resulting dataset provides a solid foundation for our subsequent 
analyses and the development of the Real Estate Alpha Calculator. 


3.3 Overview of Montreal’s Real Estate Market 


Montreal, a vibrant multicultural city of 1.7 million residents, has emerged as a significant 
player in Canada’s real estate market. Known for its rich culture, delicious cuisine, and growing 
reputation as an AI hub, Montreal attracts a diverse population of residents and investors alike. 
As of September 2024, Montreal is considered the most affordable metropolis in Canada, according 
to CanadIm [2]. The city currently experiences a seller’s market, with a sales-to-new listings ratio 
of 62% in August 2024 [14]. Recent data shows median price increases across various property 
types compared to the previous year: 
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e Single-family homes: 5.2% increase 
e Condominiums: 3.6% increase 


e Plexes: 6% increase 


However, with interest rates at 6.45% as of September 2024, significantly higher than the near-zero 
rates seen during the pandemic (2020-2021) [16], many prospective buyers express concerns about 
affordability. 

In the following subsections, we will analyze the Montreal real estate market using our data, 
providing insights to enhance the reader’s understanding of various factors that may impact the 
performance of our Alpha Calculator. 


3.3.1 Neighborhood Analysis 


Price Distribution Across Neighborhoods 
Our analysis reveals distinct patterns in both rental and property sales prices across Montreal’s 
neighborhoods: 


$1710.25 (Highest Rent) 


(a) Average Renting (b) Average Property Sales Price 


Figure 2: Average Rent (a) and Property Sales Price (b) in each neighborhoods. 


e Saint-Leonard emerges as the most affordable area for both renting and buying. 


e Central areas (e.g., Ville Marie, Griffintown, Outremont) show higher rental prices relative 
to purchase prices, possibly due to the scarcity of single-family homes and plexes in these 
locations; Consequently favoring the sale of condos, which tend to be cheaper. 


e Northeastern areas, with limited metro access[17], tend to have lower rental prices but 
higher property purchase prices. 


It’s important to note that, in this representation, some of our data is aggregated by larger 
districts in this map, which may obscure nuances in smaller neighborhoods, particularly in areas 
like Mercier-Hochelaga-Maisonneuve, L’Ile-des-Sceurs-Verdun and Saint-Michel-Parc-Extension- 
Villeray-Mile-End. 


Data Frequency by Neighborhood 
The dataset shows varying levels of representation across neighborhoods: 
Rental Market: 
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Figure 3: Data amounts per neighborhoods. 


e Low representation (< 300 data points): Riviére-des-Prairies, Pointe-Saint-Charles, Anjou, 
Lachine, Saint-Leonard, Verdun, Pierrefonds, L’le-des-Sceurs, L’Ile-Bizard-Sainte-Geneviéve 


e High representation: Griffintown, Céte-des-Neiges, downtown area 
Property Sales: 
e Higher representation (> 500 data points): Saint-Léonard, Anjou 
e Lower representation (< 300 data points): Sainte-Geneviéve 
e Highest representation: Downtown area, Rosemont, Griffintown, Plateau-Mont-Royal 


This distribution suggests that areas with better public transportation and proximity to downtown 
have more active real estate markets, potentially indicating higher population density or more 
frequent tenant turnover. 

Property Type Distributions 

Properties from our dataset belong to either ones of theses property types : Condo’, House’, 
*Multiplex’, Lot’, Commercial’. The following char({4| gives a the proportion of the how each 
types share the market. As suggested earlier, our intuition was that Condos were more common 
in Central Areas. To verify this assumption, we have plotted property types across our data and 
across certain neighborhoods. At first glance at [5] we can safely confirm our hypothesis. 

To better see the disitrbitions in different areas, in the next graph we has grouped the distributions 
by area. 

An another insight that we notice from this representation is that Condos seems to be less popular 
in areas the East and West extremes of the Island (Pointe-aux-Trembles-Riviere-des-Prairies and 
Pierrefonds-Llle-Bizard-Pointe-Claire). 


3.3.2 Price Distributions 


Analysis of price distributions reveals that: 

Both distributions are normal, but property sales show a wider range compared to rentals. 
The rental market appears more accommodating with a tighter price range, suggesting greater 
affordability and flexibility in Montreal’s rental sector compared to property purchases. 
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Figure 4: Property types chart. 
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Figure 5: Property type distribution per neighborhood. 


3.3.3 Correlation Analysis 


Property Sales 
Most features show low correlation, suggesting independence between each other. Exceptions 
include pool-related features and elevator presence with adapted mobility features. 
Rental Market 
Rental data shows stronger correlations between features. 
Notable correlations: 


e Gym and elevator (0.7-0.8): Indicative of multi-story condominium complexes, possibly 
indicating larger, amenity-rich developments with higher-priced units 


e Air conditioning and dishwasher (0.6-0.7): Suggests bundling of amenities by property 
owners 


These correlations provide insights into property characteristics and potential pricing factors 
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Figure 7: Price distributions for the rentals and property sales data. 


in the rental market. Additionally, the correlated features in rental data may enhance model 
robustness and performance through information redundancy. 


3.3.4 Market Insights and Trends 


Montreal remains relatively affordable compared to other Canadian metropolises, despite 
recent price increases. 


Central areas show higher rental premiums, while some outlying areas offer better value for 
property purchases. 


Amenities significantly impact rental prices, with gym access and modern appliances 
commanding a premium. 


The market appears to favor sellers, but high interest rates may impact affordability and 
demand. 


Areas with good public transit connectivity show more active markets for both rentals and 
sales. 


Real Estate Alpha Calculator 14 


(a) Property Sales (b) Rental 


Figure 8: Heatmaps of correlations between feature for Property Sales and Rental data. 


3.3.5 Conclusion 


Montreal’s real estate market presents a complex landscape with opportunities and challenges for 
both investors and residents. While the city remains relatively affordable, recent trends suggest a 
gradual increase in prices across all property types. The disparity between central and peripheral 
areas, coupled with the impact of amenities on pricing, offers diverse investment opportunities. 
However, potential buyers and investors should carefully consider the effects of higher interest 
rates on long-term affordability and market dynamics. 


3.4 Risk 


In the context of real estate investment, calculating risk presents unique challenges due to the 
nature of property transactions and available data. Our approach to computing risk takes into 
account several key factors: 


1. The specificity of the real estate market 
2. Limited data availability (from 2020 onward) 


3. Infrequent sales of individual properties within a short timeframe 


To address these challenges, we’ve developed a novel method for estimating risk that considers 
similar properties within a neighborhood and across the entire city of Montreal. 


3.4.1 Risk Calculation Methodology 


Let A be the property of interest, G4 be a property within the same neighborhood as A, and M 
be a property within the entire city of Montreal. We define: 


e P4: The set of all properties P with features similar to A 
e My: The set of all properties M with features similar to A, excluding neighborhood 


Since the sample sizes of M, and P4 differ, directly computing the correlation between them 
is not feasible. To resolve this, we use aligned samples from both distributions, denoted as M4 
and P$, where the a index refers to the aligned data. The method for aligning these samples is 
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detailed in subsection As data alignment does not affect the calculation of relative volatility, 
the standard deviations for both distributions are computed using the full datasets. 


The risk 6 is calculated using the following formula: 


standardDeviation( P4) 
standardDeviation(M,) 


B(A, M) = Corr(P4, M4) - 


3.4.2 Aligning mismatching data counts 


Naturally, as we are using similar properties to the asset P4, the amount of data fitting those 
criterias is unlikely to align with the entire market dataset. To address the discrepancy in the 
number of price points between P4 and M4, we developed a temporal grouping technique. Below 
is a simplified explanation; for more details, refer to Algorithm [3] in the appendix: 


1. Group prices for both P4 and M4 by the week (Since Real Estate Prices appear much more 
stable than publicly traded assets, we consider a week the be a good enough time frame for 
a measure of "instant price") 


2. Pair matching P4 and M4 groups by the week, discard the remaining groups 


3. For each group, we calculate the median(instead of the mean to avoid sensitivity of the 
extremes) 


After this, we are left with aligned samples of P4 and M4 we are ready to calculate our correlation. 


3.4.3 Risk Normalization 


As previously mentioned, the nature of RE data results in a lower risk investment setting compared 
to the stock market. To make better use of this metric, we propose emphasizing the risk by 
computing a £ relative to the Montreal real estate market. Specifically, we calculate 6 for each 
unit in our investment landscape, which is divided into {neighborhood; property type} pairs. 
We then normalize the 84 of interest using the maximum maz and minimum min values found 
in our dataset. The normalization is done as follows: 


B = Ba — Bmin 
fii Bmax = Bmin 
To provide insight into the riskiest markets in the City of Montreal based on our data, we have 
computed a risk table for each neighborhood, normalized across property types. The results are 
shown in Figure [9 
Finally, after normalization, we scale 6 to the range of —1 to 2 to ensure that it captures and 
conveys the complete spectrum of information to a. 


3.5 Revenue and Price Prediction 


As mentioned earlier, we trained regression models on our scraped data to predict property sales 
prices and rental rates, which we use to estimate the expected return on a property. Once our 
price prediction function, fp, and rental prediction function, fy, are trained, we can generate 
these forecasts. 

For the property sales price, the process is straightforward: we simply predict the price based 
on the features of the property we are evaluating. However, estimating revenues and profits is 
more complex. The user first provides an estimate of the property’s operating costs, and we then 
calculate yearly revenues by estimating monthly rental income and multiplying it by 12. 
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Figure 9: Relative market volatility by property type and neighborhood. The red values are 
vacant from our data. 


Since our rental model operates on a per-unit basis, we require unit-specific information to make 
accurate predictions. This is relatively simple for single-family homes or condos but becomes 
more challenging for multi-unit properties, where each unit may vary in terms of the number of 
rooms. 

To address this, we offer two options: 


1. The user enters individual details for each unit (number of rooms, bathrooms, and area in 
square feet). 


2. The user provides no details, and we estimate the units based on our data. 


To make the realistic unit estimate, we proceed as follows: 


e We start by using rental data from the same neighborhood as the property to ensure relevant 
comparisons. 


The most common room configurations are identified, and the average room size is calculated 
from nearby properties. 


e We estimate rental income for each room configuration, adjusting for the unit’s size and 
weighting more common configurations accordingly. 


e The total revenue for the property is calculated by multiplying the estimated rental income 
per unit by the total number of units. 


Finally, once we have our revenues, operating costs and asset price, we are ready to calculate 
alpha. 


3.6 Alpha Calculation 


To calculate a, we use the formula agg from in the following algorithm: 


4 Results and Discussion 


4.1 Training setup 


Non-Deep Estimators 
Non-deep estimators were implemented using Scikit-Learn, with StandardScaler preprocessing. We 
evaluated Random Forest, SVR, XGBoost, CatBoost, Gradient Boosting, LightGBM, NGBoost, 
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Algorithm 1 Alpha Calculation for Real Estate Investment 


Require: purchase_ price, revenues, operating costs, property, marketgata, risk_ free_rate 
: beta + COMPUTEBETA(property, marketaata, prices) 

NOI + revenues — operating costs 

: actual_ returns + NOI/purchase_ price 

: expected_ returns + ESTIMATERETURNS(property, operating costs) 

alpha + actual_returns—risk_ free_rate—beta x (expected_returns—risk_ free_rate) 
: return a 


Dua PON 


Ridge, Lasso, and ElasticNet models. Hyperparameter tuning employed 5-fold cross-validation 
GridSearch (see Appendix [6). Experiments were conducted on an Intel i7-8700 CPU (3.20GHz, 
24GB RAM). Feature selection attempts yielded no improvements. Despite exploring various 
ensembling techniques (stacking, voting, averaging), individual models—RandomForest for renting 
data and Gradient Boosting for property data—consistently outperformed ensemble methods. 
Deep Learning Models 

All DL models were implemented using PyTorch and PyTorch Lightning. We explored ResNet, 
FT Transformers, and Tab Transformer architectures. In addition, we explored pretraining our 
Transformer using Masked Autoencoder approach. The models were trained with an AdamW 
optimizer, cosine learning rate scheduling, and trained for 100 epochs with a batch size of 32 and 
initial learning rate of 1074. Experiments were conducted on an NVIDIA GeForce GTX1060 
(3GB) GPU. After training each model, the three models were ensembled using a MLP as a 
meta-learner for stacking. The stacking models were trained for an additional 100 epochs. For 
each dataset, stacking was evaluated in two configurations: one using the output logits from the 
final layer of each model (referred to as DeepStackyogits) and another using the predicted price 
(referred to as DeepStackp;yeas). For detailed hyperparameters, refer to Appendix |??| 


4.2 Evaluation 


The Evaluation was conducted on two datasets: PropertySalesPrice (27,712 samples, price range 
$9, 500-$7, 500, 000) and Renting (21,833 samples, price range $280—$5, 990). Both datasets were 
restricted to Montreal and underwent loglp transformation for training. We employed an 80:20 
train-test split for non-deep estimators and a 70:10:20 train-validation-test split for deep models. 
Performance was assessed using Root Mean Squared Logarithmic Error (RMSLE), defined in 
(7), where p; and a; are predicted and actual values, respectively. This metric was chosen for its 
sensitivity to relative errors. The same training regimen was applied consistently across both 
PropertySalesPrice and Renting datasets. 


n 


1 

a X _(log(p; + 1) — log(a; + 1))? (7) 
i=l 

4.3 Experimental Results 

4.3.1 Model Performance 


Table [I] displays performance results, measured using RMSLE, for both Property and Renting 
datasets. The estimators are listed at the top of the table, while the deep learning models are 
shown at the bottom. 
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Table 1: RMSLE Performance of Various Models on Renting and Property Data 


Model Renting | Property | 


RandomForest 0.1121 0.1862 
GBR 0.1391 0.1851 
CatBoost 0.1381 0.2060 
LightGBM 0.1412 0.2124 
XGBoost 0.2822 0.2061 
NGBoost 0.2668 0.4487 
Ridge 0.2629 0.4507 
SVR 0.1758 0.3420 
ElasticNet 0.2723 0.4642 
Lasso 0.3233 0.5155 
FT-Transformer 0.1884 0.2594 
Tab-Transformer 0.2367 0.3101 
ResNet 0.2059 0.3048 
DeepStackpyeags 0.1195 0.2263 
DeepStackLogits 0.1441 0.2345 


Note: Bold values indicate the best-performing models (lowest RMSLE) for each dataset. 


4.3.2 Discussion 


The results across different models for the Renting and Property datasets highlight several notable 
patterns in the performance of both deep learning and non-deep learning models for tabular data 
regression tasks. 


Renting Dataset The RandomForest model outperformed all other models on the renting 
dataset, achieving an RMSLE of 0.1121, significantly better than the second-best estimator model, 
CatBoost (0.1381), and Gradient Boosting Regressor (0.1391). This indicates RandomForest’s 
strong generalization ability in this context, particularly for tabular data with simpler categorical 
features, such as the predominantly boolean categories in the renting dataset. DeepStackpyeds 
achieved second place across all models, with an RMSLE of 0.1195. 

Among boosting methods, a noticeable gap in performance was observed. While CatBoost, 
LightGBM, and GBR performed well, models like XGBoost (0.2822) and NGBoost significantly 
underperformed. This could be attributed to the different tree growth strategies employed by 
these models. The former set of models use a leaf-wise (best-first) tree growth strategy, while 
XGBoost and NGBoost rely on a level-wise (depth-first) strategy. Additionally, models like 
CatBoost and LightGBM have better native support for handling categorical data, which may 
explain their stronger performance. 

Another observation is that the overall performance on the renting dataset was better compared 
to the property dataset. This may be due to the simpler nature of the categorical features in the 
renting dataset, where most categorical variables are Boolean, unlike the property dataset, which 
contains fields with multiple categories. For instance, the property type feature in the property 
dataset has 9 categories, making optimization more complex. 


Property Dataset On the property dataset, GBR was the best-performing model, with an 
RMSLE of 0.1851, followed closely by RandomForest (0.1862). Most boosting methods (CatBoost, 
LightGBM, GBR) performed similarly, achieving RMSLEs within a narrow range (0.2 + 0.015). 
However, NGBoost performed poorly, with an RMSLE of 0.4487, indicating its ineffectiveness in 
this context. 


Across Datasets A consistent observation across both datasets is the underperformance of 
Lasso regression, which recorded the worst results, with an RMSLE of 0.3233 on the renting 
dataset and 0.5155 on the property dataset. This highlights Lasso’s limitations for regression tasks 
involving complex tabular data, where interactions between features are essential for accurate 
predictions. 
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Deep learning models demonstrated competitive performance compared to traditional estimators 
but did not outperform the best non-deep models. Despite extensive hyperparameter tuning and 
significantly higher computational resource requirements, deep learning models could not close 
the gap with the best-performing classical models such as RandomForest and GBR. This suggests 
that, for these types of tabular datasets, deep learning may not yet be ready to replace classical 
models, especially when considering the disproportionate effort required for training. 

However, stacking methods did show promise when applied to the deep learning models. Unlike 
the ensembling attempts with classical models, which did not lead to improvements, stacking 
consistently enhanced the performance of deep learning models. For both the renting and property 
datasets, stacking using the model predictions (as opposed to logits) yielded the best results. 
Among the deep learning models tested, F'T-Transformer can be considered state-of-the-art 
(SOTA) for tabular data on these datasets, achieving an RMSLE of 0.1884 on the renting 
dataset and 0.2594 on the property dataset. While it still lags behind the best classical models, 
FT-Transformer stands out as the most promising deep learning model in this domain. 


Final Remarks Overall, the findings underscore the effectiveness of traditional models like 
RandomForest and GBR for tabular data regression, particularly for datasets with complex 
categorical features, such as the property dataset. While deep learning models show potential, 
their practical value remains limited by the significant effort required for training and tuning. 
Further advancements in deep learning architectures and optimization techniques may be necessary 
before these models can consistently outperform classical methods on tabular data. 


4.4 Real Estate Alpha Calculator Demonstration 


4.4.1 Context 


Imagine you are an investor looking for a quadruplex in the city. You are considering two 
properties located in Downtown and Saint-Michel (see Table p). 


Table 2: Potential Quadruplexes Comparison Table 


Price Neighborhood Property Type Revenue Operating Costs 
$860,000 Saint-Michel Quadruplex $38,000 $3,926 
$875,000 Downtown Quadruplex $67,200 $5,951 


Note: Examples taken from real data available on DuProprio. 


After entering the details of each deal into our calculator, we obtained the following outputs (see 


Table |3). 


Table 3: Alpha Calculation Output 


Neighborhood Downtown Saint-Michel 


Expected Returns 3.769% 4.438% 
Actual Returns 7% 3.962% 
Risk-Free Rate 4.72% 

Beta 0.3684 1.464 
Alpha 2.63% -0.3452% 


Now let’s break down the agg formula (see Equation {4} to understand why these two properties 
have the alpha values they do. 
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General Information General Information 
Has Rental Unit Details: O Risk Free Rate: 472 Has Rental Unit Details: O Risk Free Rate: 472 
Property Details Property Details 
Neighborhood: Source: Neighborhood: Source: 
Downtown DuProprio Saint-Michel DuProprio 
Property Type: Purchase Price: Property Type: Purchase Price: 
Quadruplex 875000 $ Quadruplex 860000 $ 
Potential Gross Revenues (Annual): Operating Costs (Annual): Potential Gross Revenues (Annual): Operating Costs (Annual): 
67200 $ 5951 $ 38000 $ 3926 $ 
Total Units: Residential Units: Business Units: Total Units: Residential Units: Business Units: 
4 4 o 4 4 o 
Number of Parkings: Parking Type: Number of Parkings: Parking Type: 
fe) None 2 Double drive Garage 
Construction Date: Construction Status: Construction Date: Construction Status: 
1885 Century 1959 Normal 
(a) Quadruplex in Downtown (b) Quadruplex in Saint-Michel 


Figure 10: Alpha calculator input examples for both quadruplexes. 


arp = Actual Returns — [Rf + Bc x (Expected Returns — R,)] 
Qgaint-Michel = 3.962 — 4.72 — 1.464 x (4.438 — 4.72) 
= —0.3452 
QDowntown = 7 — 4.72 — 0.3684 x (3.769 — 4.72) 
= 2.63 


Even though both properties are of the same type, the one in Downtown has a much higher a 
than its counterpart, despite a $15,000 higher price. There are two main reasons for this: 


e The Downtown property yields significantly higher revenues for what you pay upfront and 
for its maintenance, with a 7% return compared to 3.962%. 


e The market for quadruplexes in Saint-Michel is riskier than that in Downtown (with a 8 of 
1.464 vs. 0.3684) (see Figure |9). 


To improve the alpha for the Saint-Michel property, the investor should look into reducing 
operating costs and increasing rental income. 


5 Conclusion 


This paper introduced the Real Estate Alpha Calculator, a novel tool designed to streamline 
property investment screening in Montreal’s real estate market. By adapting the Capital Asset 
Pricing Model (CAPM) to real estate investments and leveraging machine learning for price 
predictions, we developed a systematic approach to quantifying potential returns while accounting 
for systemic risks. 

Our methodology encompassed the development of a real estate beta calculation, comprehensive 
neighborhood profiling, and the implementation of various machine learning models for property 


Real Estate Alpha Calculator 21 


sales and renting price prediction. The empirical results demonstrated that traditional machine 
learning approaches outperformed more complex deep learning models, with Random Forest 
achieving optimal performance for rental properties (RMSLE 0.1121) and Gradient Boosting 
Regressor excelling for property values (RMSLE 0.1851). The analysis revealed that the relative 
simplicity of rental data categories contributed to more accurate predictions compared to the 
complex property dataset. 

While the Real Estate Alpha Calculator represents a significant advancement in quantitative real 
estate investment analysis, several limitations should be acknowledged. The tool does not account 
for potential property appreciation over time, and predictions for certain areas are constrained by 
limited data availability. Additionally, some property-specific risks may not be fully captured by 
the systemic risk assessment. Despite these constraints, the calculator provides investors with a 
valuable, data-driven approach to initial property screening, effectively bridging the gap between 
financial modeling and practical real estate investment decisions. 
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Appendices 


A Appendix: Implementation Details 


Here are implementation details for the machine learning models. 


Table 4: Model Architectures and Parameters 


Model Dim | Depth | Heads | Dim_ head | Params (M) 
FT-Transformer 192 6 4 192 6.4 
Tab-Transformer | 192 6 4 192 10.4 
ResNet 256 8 NA NA 2.6 
DeepStack 256 3 NA NA X.X 
Table 5: Torch Training Hyperparameters for Grid Search 

Hyperparameter Values 

Optimizer AdamW 

Epochs 100 

Learning Rate (lr) {le-5, le-4, 1e-3} 

Weight Decay {le-5, le-4} 


Scheduler 
Batch Size 


Optimizer Momentum ((1, 62) 


CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2) 
32 
(0.9, 0.999) 
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Algorithm 2 Beta Calculation 


1: 


an oF WwW WY 


property_std > Prices paired with their observed time for a given property type & 
neighborhood 
: market_std > Prices paired with their observed dates for a given property type 
: property_std = standard_deviation(property_prices) 
: market_std = standard_deviation(market_prices) 
: correlation = calculate_weekly_correlation(property_prices, market_prices) 
: beta = correlation x ( Property st 


return beta 


Algorithm 3 Weekly Correlation Calculation 


oe O N e 


: property[’creation_date’] = convert_to_datetime(property[’ creation_date’]) 

: market [’creation_date’] = convert_to_datetime (market [’ creation_date’]) 

: property [’year_week’] = extract_iso_week(property[’creation_date’]) 

: propertyL’year’] = extract_year (propenty i ereation, date’]) 

: property [’year_week’ ] = format_year_week(property[’year’], 


property[’year_week’]) 


: market [’year_week’] = extract_iso_week(market[’creation_date’]) 

: market [’year’] = extract_year (market [’creation_date’]) 

: market [’year_week’] = format_year_week(market[’year’], market[’year_week’]) 

: property_grouped = group_by (property, ’year_week’).calculate_median(’price’) 
: market_grouped = group_by (market, ’year_week’) .calculate_median(’ price’ ) 

: matching weeks = intersect (property_grouped.index, market_grouped. index) 

: property_medians = [property_grouped[week] for week in matching_weeks| 

: market_medians = |market_grouped[week] for week in matching_weeks| 

: if len(property_ medians) > 1 and len(market_ medians) > 1 then 


correlation = calculate_correlation(property_medians, market_medians) 


: else 


correlation = None > Not enough data for correlation 


: end if 


return correlation 
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Table 6: Grid Search Hyperparameters for Various Models 


Model Hyperparameter Values 
RandomForestRegressor n_ estimators 50, 300 
max_ depth 3, 20 
min_samples_ split 2, 10 
min_samples_leaf 1,4 
max_ features 0.1, 1.0 
SVR C 0.1, 10 
gamma 0.01, 1 
epsilon 0.1, 0.5 
XGBRegressor n_estimators 50, 100, 200, 500 
max_ depth 3, 5, 7, 10 
learning_rate 0.01, 0.1, 0.2, 0.3 
subsample 0.5, 0.8, 0.9, 1.0 
colsample_ bytree 0.3, 0.7, 1.0 
CatBoostRegressor iterations 50, 300 
depth 4, 10 
learning _rate 0.01, 0.3 
12_leaf_reg 1,5 
bagging temperature 0.0, 1.0 
GradientBoostingRegressor n_ estimators 50, 300 
max_ depth 3, 10 
learning _rate 0.01, 0.3 
subsample 0.5, 1.0 
max_ features 0.1, 1.0 
LGBMRegressor n_ estimators 50, 300 
max_ depth -1, 20 
learning _rate 0.01, 0.3 
num_leaves 20, 100 
subsample 0.5, 1.0 
NGBRegressor n_ estimators 50, 300 
learning _rate 0.01, 0.3 
minibatch_ frac 0.5, 1.0 
col_ sample 0.5, 1.0 
max_ depth 3, 10 
Ridge alpha 0.1, 1.0, 10.0, 100.0 
solver auto, svd, cholesky, Isqr, sparse_cg, sag, saga 
Lasso alpha 0.1, 100.0 
max_ iter 1000, 5000 
tol 0.0001, 0.01 
ElasticNet alpha 0.1, 1.0, 10.0, 100.0 
11_ ratio 0.1, 0.5, 0.9 
max_ iter 1000, 5000 
tol 0.0001, 0.001, 0.01 


